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Abstract 

We present a trainable model for identify- 
ing sentence boundaries in raw text. Given 
a corpus annotated with sentence bound- 
aries, our model learns to classify each oc- 
currence of . , ?, and / as either a valid or in- 
valid sentence boundary. The training pro- 
cedure requires no hand-crafted rules, lex- 
ica, part-of-speech tags, or domain-specific 
information. The model can therefore be 
trained easily on any genre of English, and 
should be trainable on any other Roman- 
alphabet language. Performance is compa- 
rable to or better than the performance of 
similar systems, but we emphasize the sim- 
plicity of retraining for new domains. 

1 Introduction 

The task of identifying sentence boundaries in text 
has not received as much attention as it deserves. 
Many freely available natural language processing 
tools require their input to be divided into sentences, 
but make no mention of how to accomplish this (e.g. 
([Brill, 1994 |Collins, 19961) ). Others perform the 
division implicitly without discussing performance 
(e.g. ( putting et al., 1992[ )). 

On first glance, it may appear that using a short 
list of sentence-final punctuation marks, such as ., 
?, and /, is sufficient. However, these punctua- 
tion marks are not used exclusively to mark sen- 
tence breaks. For example, embedded quotations 
may contain any of the sentence-ending punctua- 
tion marks and . is used as a decimal point, in e- 
mail addresses, to indicate ellipsis and in abbrevia- 
tions. Both / and ? are somewhat less ambiguous 



but appear in proper names and may be used mul- 
tiple times for emphasis to mark a single sentence 
boundary. 

Lexically-based rules could be written and excep- 
tion lists used to disambiguate the difficult cases 
described above. However, the lists will never be 
exhaustive, and multiple rules may interact badly 
since punctuation marks exhibit absorption proper- 
ties. Sites which logically should be marked with 
multiple punctua tion marks will ofte n only have o ne 
( flNunbcrg, 199C| ) as summarized in ( |Whitc, 1995[) ). 
For example, a sentence- ending abbreviation will 
most likely not be followed by an additional period 
if the abbreviation already contains one (e.g. note 
that D. C is followed by only a single . in The presi- 
dent lives in Washington, D.C.). 

As a result, we believe that manually writing rules 
is not a good approach. Instead, we present a solu- 
tion based on a maximum entropy model which re- 
quires a few hints about what information to use and 
a corpus annotated with sentence boundaries. The 
model trains easily and performs comparably to sys- 
tems that require vastly more information. Training 
on 39441 sentences takes 18 minutes on a Sun Ultra 
Sparc and disambiguating the boundaries in a single 
Wall Street Journal article requires only 1.4 seconds. 

2 Previous Work 

To our knowledge, there have been few papers about 
identifying sentence boundaries. The most recent 
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work will be described in (Palmer and Hearst, To 
|appear ). There is also a less detailed description of 
Palmer and Hearst's system, SATZ, in (Palmer and 
Hearst, 1994| ).pi The SATZ architecture uses either 
a decision tree or a neural network to disambiguate 
sentence boundaries. The neural network achieves 
98.5% accuracy on a corpus of Wall Street Journal 

1 We recommend these articles for a more compre- 
hensive review of sentence-boundary identification work 
than we will be able to provide here. 



articles using a lexicon which includes part-of-speech 
(POS) tag information. By increasing the quantity 
of training data and decreasing the size of their test 
corpus, Palmer and Hearst achieved performance of 
98.9% with the neural network. They obtained simi- 
lar results using the decision tree. All the results we 
will present for our algorithms are on their initial, 
larger test corpus. 



In (Riley, 1989), Riley describes a decision-tree 
based approach to the problem. His performance on 
the Brown corpus is 99.8%, using a model learned 
from a corpus of 25 million words. Liberman and 
Church suggest in ( Liberman and Church, 1992] ) 



that a system could be quickly built to divide 
newswire text into sentences with a nearly negligible 
error rate, but do not actually build such a system. 

3 Our Approach 

We present two systems for identifying sentence 
boundaries. One is targeted at high performance 
and uses some knowledge about the structure of En- 
glish financial newspaper text which may not be ap- 
plicable to text from other genres or in other lan- 
guages. The other system uses no domain-specific 
knowledge and is aimed at being portable across En- 
glish text genres and Roman alphabet languages. 

Potential sentence boundaries are identified by 
scanning the text for sequences of characters sep- 
arated by whitespace (tokens) containing one of the 
symbols /, . or ?. We use information about the to- 
ken containing the potential sentence boundary, as 
well as contextual information about the tokens im- 
mediately to the left and to the right. We also con- 
ducted tests using wider contexts, but performance 
did not improve. 

We call the token containing the symbol which 
marks a putative sentence boundary the Candidate. 
The portion of the Candidate preceding the poten- 
tial sentence boundary is called the Prefix and the 
portion following it is called the Suffix. The system 
that focused on maximizing performance used the 
following hints, or contextual "templates" : 

• The Prefix 

• The Suffix 

• The presence of particular characters in the Pre- 
fix or Suffix 

• Whether the Candidate is an honorific (e.g. 
Ms., Dr., Gen.) 

• Whether the Candidate is a corporate designa- 
tor (e.g. Corp., S.p.A., L.L.C.) 



• Features of the word left of the Candidate 

• Features of the word right of the Candidate 

The templates specify only the form of the in- 
formation. The exact information used by the 
maximum entropy model for the potential sentence 
boundary marked by . in Corp. in Example [l] would 
be: Previous WordlsCapitalized, Prefix= Corp, Suf- 
fix=NULL, PrefixFeature=CorporateDesignator. 

(1) ANLP Corp. chairman Dr. Smith resigned. 

The highly portable system uses only the identity 
of the Candidate and its neighboring words, and a 
list of abbreviations induced from the training data.f] 
Specifically, the "templates" used are: 

• The Prefix 

• The Suffix 

• Whether the Prefix or Suffix is on the list of 
induced abbreviations 

• The word left of the Candidate 

• The word right of the Candidate 

• Whether the word to the left or right of the 
Candidate is on the list of induced abbreviations 

The information this model would use for Exam- 
ple [l] would be: PreviousWord=j4-/VLP, Following- 
Word= chairman, Prefix= Corp, Sufhx=NULL, Pre- 
fixFeature=InducedAbbreviation. 

The abbreviation list is automatically produced 
from the training data, and the contextual ques- 
tions are also automatically generated by scanning 
the training data with question templates. As a re- 
sult, no hand-crafted rules or lists are required by 
the highly portable system and it can be easily re- 
trained for other languages or text genres. 

4 Maximum Entropy 

The model used here for sentence-boundary de- 
tection is based on the maximum entropy model 
used for POS tagging in ( Ratnaparkhi, 1996 ). For 
each potential sentence boundary token (., ?, and 
!), we estimate a joint probability distribution p 
of the token and its surrounding context, both of 
which are denoted by c, occurring as an actual 
sentence boundary. The distribution is given by: 
p{b,c) — TtYij—i a^ b ' c \ where b £ {no, yes}, where 



2 A token in the training data is considered an abbre- 
viation if it is preceded and followed by whitespace, and 
it contains a . that is not a sentence boundary. 



the Oj's are the unknown parameters of the model, 
and where each aj corresponds to a fj , or a feature. 
Thus the probability of seeing an actual sentence 
boundary in the context c is given by p(yes, c). 

The contextual information deemed useful for 
sentence-boundary detection, which we described 
earlier, must be encoded using features. For exam- 
ple, a useful feature might be: 



fj(b,c) 



1 if Prefix(c) = Mrkb 
otherwise 



This feature will allow the model to discover that the 
period at the end of the word Mr. seldom occurs as 
a sentence boundary. Therefore the parameter cor- 
responding to this feature will hopefully boost the 
probability p(no, c) if the Prefix is Mr. The param- 
eters are chosen to maximize the likelihood of the 
t raining data using the Gen eralized Iterative Scaling 
( Darroch and Ratcliff, 1972 ) algorithm. 

The model also can be viewed under the Maxi- 
mum Entropy framework, in which we choose a dis- 
tribution p that maximizes the entropy H (p) 

H (p) = -^2p( b ' c ) iogpO, c) 
under the following constraints: 

5>(M/j(&,c) = J2P(b,c)fj(b,c),l<j < k 

where p(b, c) is the observed distribution of sentence- 
boundaries and contexts in the training data. As a 
result, the model in practice tends not to commit 
towards a particular outcome (yes or no) unless it 
has seen sufficient evidence for that outcome; it is 
maximally uncertain beyond meeting the evidence. 

All experiments use a simple decision rule to clas- 
sify each potential sentence boundary: a potential 
sentence boundary is an actual sentence boundary if 
and only if p(yes|c) > .5, where 



p(yes|c) 



p(yes,c) 



p(yes,c) + p(no,c) 



and where c is the context including the potential 
sentence boundary. 

5 System Performance 

We trained our system on 39441 sentences (898737 
words) of Wall Street Journal text from sections 
00 through 24 of the second release of the Penn 



Treebank 3 (Marcus, Santorini, and Marcinkicwicz 



did not train on files which overlapped with 
Palmer and Hearst's test data, namely sections 03, 04, 





WSJ 


Brown 


Sentences 


20478 


51672 


Candidate P. Marks 


32173 


61282 


Accuracy 


98.8% 


97.9% 


False Positives 


201 


750 


False Negatives 


171 


506 



Table 1: Our best performance on two corpora. 



1993). We corrected punctuation mistakes and er- 
roneous sentence boundaries in the training data. 
Performance figures for our best performing system, 
which used a hand-crafted list of honorifics and cor- 
porate designators, are shown in Table |l|. The first 
test set, WSJ, is Palmer and Hearst's initial test 
data and the second is the entire Brown corpus. We 
present the Brown corpus performance to show the 
importance of training on the genre of text on which 
testing will be performed. Table [l] also shows the 
number of sentences in each corpus, the number of 
candidate punctuation marks, the accuracy over po- 
tential sentence boundaries, the number of false posi- 
tives and the number of false negatives. Performance 
on the WSJ corpus was, as we expected, higher than 
performance on the Brown corpus since we trained 
the model on financial newspaper text. 

Possibly more significant than the system's per- 
formance is its portability to new domains and lan- 
guages. A trimmed down system which used no 
information except that derived from the training 
corpus performs nearly as well, and requires no re- 
sources other than a training corpus. Its perfor- 
mance on the same two corpora is shown in Table 0. 



Test 
Corpus 


Accuracy 


False 
Positives 


False 
Negatives 


WSJ 


98.0% 


396 


245 


Brown 


97.5% 


1260 


265 



05 and 06. 



Table 2: Performance on the same two corpora using 
the highly portable system. 

Since 39441 training sentences is considerably 
more than might exist in a new domain or a lan- 
guage other than English, we experimented with the 
quantity of training data required to maintain per- 
formance. Table || shows performance on the WSJ 
corpus as a function of training set size using the best 
performing system and the more portable system. 
As can seen from the table, performance degrades 
as the quantity of training data decreases, but even 





Number of sentences in training corpus 


500 


1000 


2000 


4000 


8000 


16000 


39441 


Best performing 


97.6% 


98.4% 


98.0% 


98.4% 


98.3% 


98.3% 


98.8% 


Highly portable 


96.5% 


97.3% 


97.3% 


97.6% 


97.6% 


97.8% 


98.0% 



Table 3: Performance on Wall Street Journal test data as a function of training set size for both systems. 



with only 500 example sentences performance is bet- 
ter than the baselines of 64.0% if a sentence bound- 
ary is guessed at every potential site and 78.4% if 
only token-final instances of sentence-ending punc- 
tuation are assumed to be boundaries. 

6 Conclusions 

We have described an approach to identifying sen- 
tence boundaries which performs comparably to 
other state-of-the-art systems that require vastly 
more resources. For example, Riley's performance 
on the Brown corpus is higher than ours, but his sys- 
tem is trained on the Brown corpus and uses thirty 
times as much data as our system. Also, Palmer 

6 Hearst's system requires POS tag information, 
which limits its use to those genres or languages for 
which there are cither POS tag lexica or POS tag 
annotated corpora that could be used to train auto- 
matic taggers. In comparison, our system does not 
require POS tags or any supporting resources be- 
yond the sentence-boundary annotated corpus. It 
is therefore easy and inexpensive to retrain this sys- 
tem for different genres of text in English and text in 
other Roman-alphabet languages. Furthermore, we 
showed that a small training corpus is sufficient for 
good performance, and we estimate that annotating 
enough data to achieve good performance would re- 
quire only several hours of work, in comparison to 
the many hours required to generate POS tag and 
lexical probabilities. 
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